{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 第3章 分组"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>School</th>\n",
       "      <th>Class</th>\n",
       "      <th>Gender</th>\n",
       "      <th>Address</th>\n",
       "      <th>Height</th>\n",
       "      <th>Weight</th>\n",
       "      <th>Math</th>\n",
       "      <th>Physics</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1101</th>\n",
       "      <td>S_1</td>\n",
       "      <td>C_1</td>\n",
       "      <td>M</td>\n",
       "      <td>street_1</td>\n",
       "      <td>173</td>\n",
       "      <td>63</td>\n",
       "      <td>34.0</td>\n",
       "      <td>A+</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1102</th>\n",
       "      <td>S_1</td>\n",
       "      <td>C_1</td>\n",
       "      <td>F</td>\n",
       "      <td>street_2</td>\n",
       "      <td>192</td>\n",
       "      <td>73</td>\n",
       "      <td>32.5</td>\n",
       "      <td>B+</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1103</th>\n",
       "      <td>S_1</td>\n",
       "      <td>C_1</td>\n",
       "      <td>M</td>\n",
       "      <td>street_2</td>\n",
       "      <td>186</td>\n",
       "      <td>82</td>\n",
       "      <td>87.2</td>\n",
       "      <td>B+</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1104</th>\n",
       "      <td>S_1</td>\n",
       "      <td>C_1</td>\n",
       "      <td>F</td>\n",
       "      <td>street_2</td>\n",
       "      <td>167</td>\n",
       "      <td>81</td>\n",
       "      <td>80.4</td>\n",
       "      <td>B-</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1105</th>\n",
       "      <td>S_1</td>\n",
       "      <td>C_1</td>\n",
       "      <td>F</td>\n",
       "      <td>street_4</td>\n",
       "      <td>159</td>\n",
       "      <td>64</td>\n",
       "      <td>84.8</td>\n",
       "      <td>B+</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     School Class Gender   Address  Height  Weight  Math Physics\n",
       "ID                                                              \n",
       "1101    S_1   C_1      M  street_1     173      63  34.0      A+\n",
       "1102    S_1   C_1      F  street_2     192      73  32.5      B+\n",
       "1103    S_1   C_1      M  street_2     186      82  87.2      B+\n",
       "1104    S_1   C_1      F  street_2     167      81  80.4      B-\n",
       "1105    S_1   C_1      F  street_4     159      64  84.8      B+"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "df = pd.read_csv('data/table.csv',index_col='ID')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 一、SAC过程\n",
    "### 1. 内涵\n",
    "#### SAC指的是分组操作中的split-apply-combine过程\n",
    "#### 其中split指基于某一些规则，将数据拆成若干组，apply是指对每一组独立地使用函数，combine指将每一组的结果组合成某一类数据结构\n",
    "### 2. apply过程\n",
    "#### 在该过程中，我们实际往往会遇到四类问题：\n",
    "#### 整合（Aggregation）——即分组计算统计量（如求均值、求每组元素个数）\n",
    "#### 变换（Transformation）——即分组对每个单元的数据进行操作（如元素标准化）\n",
    "#### 过滤（Filtration）——即按照某些规则筛选出一些组（如选出组内某一指标小于50的组）\n",
    "#### 综合问题——即前面提及的三种问题的混合\n",
    "## 二、groupby函数\n",
    "### 1. 分组函数的基本内容：\n",
    "#### （a）根据某一列分组"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "grouped_single = df.groupby('School')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 经过groupby后会生成一个groupby对象，该对象本身不会返回任何东西，只有当相应的方法被调用才会起作用\n",
    "#### 例如取出某一个组："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>School</th>\n",
       "      <th>Class</th>\n",
       "      <th>Gender</th>\n",
       "      <th>Address</th>\n",
       "      <th>Height</th>\n",
       "      <th>Weight</th>\n",
       "      <th>Math</th>\n",
       "      <th>Physics</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1101</th>\n",
       "      <td>S_1</td>\n",
       "      <td>C_1</td>\n",
       "      <td>M</td>\n",
       "      <td>street_1</td>\n",
       "      <td>173</td>\n",
       "      <td>63</td>\n",
       "      <td>34.0</td>\n",
       "      <td>A+</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1102</th>\n",
       "      <td>S_1</td>\n",
       "      <td>C_1</td>\n",
       "      <td>F</td>\n",
       "      <td>street_2</td>\n",
       "      <td>192</td>\n",
       "      <td>73</td>\n",
       "      <td>32.5</td>\n",
       "      <td>B+</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1103</th>\n",
       "      <td>S_1</td>\n",
       "      <td>C_1</td>\n",
       "      <td>M</td>\n",
       "      <td>street_2</td>\n",
       "      <td>186</td>\n",
       "      <td>82</td>\n",
       "      <td>87.2</td>\n",
       "      <td>B+</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1104</th>\n",
       "      <td>S_1</td>\n",
       "      <td>C_1</td>\n",
       "      <td>F</td>\n",
       "      <td>street_2</td>\n",
       "      <td>167</td>\n",
       "      <td>81</td>\n",
       "      <td>80.4</td>\n",
       "      <td>B-</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1105</th>\n",
       "      <td>S_1</td>\n",
       "      <td>C_1</td>\n",
       "      <td>F</td>\n",
       "      <td>street_4</td>\n",
       "      <td>159</td>\n",
       "      <td>64</td>\n",
       "      <td>84.8</td>\n",
       "      <td>B+</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     School Class Gender   Address  Height  Weight  Math Physics\n",
       "ID                                                              \n",
       "1101    S_1   C_1      M  street_1     173      63  34.0      A+\n",
       "1102    S_1   C_1      F  street_2     192      73  32.5      B+\n",
       "1103    S_1   C_1      M  street_2     186      82  87.2      B+\n",
       "1104    S_1   C_1      F  street_2     167      81  80.4      B-\n",
       "1105    S_1   C_1      F  street_4     159      64  84.8      B+"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grouped_single.get_group('S_1').head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### （b）根据某几列分组"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>School</th>\n",
       "      <th>Class</th>\n",
       "      <th>Gender</th>\n",
       "      <th>Address</th>\n",
       "      <th>Height</th>\n",
       "      <th>Weight</th>\n",
       "      <th>Math</th>\n",
       "      <th>Physics</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2401</th>\n",
       "      <td>S_2</td>\n",
       "      <td>C_4</td>\n",
       "      <td>F</td>\n",
       "      <td>street_2</td>\n",
       "      <td>192</td>\n",
       "      <td>62</td>\n",
       "      <td>45.3</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2402</th>\n",
       "      <td>S_2</td>\n",
       "      <td>C_4</td>\n",
       "      <td>M</td>\n",
       "      <td>street_7</td>\n",
       "      <td>166</td>\n",
       "      <td>82</td>\n",
       "      <td>48.7</td>\n",
       "      <td>B</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2403</th>\n",
       "      <td>S_2</td>\n",
       "      <td>C_4</td>\n",
       "      <td>F</td>\n",
       "      <td>street_6</td>\n",
       "      <td>158</td>\n",
       "      <td>60</td>\n",
       "      <td>59.7</td>\n",
       "      <td>B+</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2404</th>\n",
       "      <td>S_2</td>\n",
       "      <td>C_4</td>\n",
       "      <td>F</td>\n",
       "      <td>street_2</td>\n",
       "      <td>160</td>\n",
       "      <td>84</td>\n",
       "      <td>67.7</td>\n",
       "      <td>B</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2405</th>\n",
       "      <td>S_2</td>\n",
       "      <td>C_4</td>\n",
       "      <td>F</td>\n",
       "      <td>street_6</td>\n",
       "      <td>193</td>\n",
       "      <td>54</td>\n",
       "      <td>47.6</td>\n",
       "      <td>B</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     School Class Gender   Address  Height  Weight  Math Physics\n",
       "ID                                                              \n",
       "2401    S_2   C_4      F  street_2     192      62  45.3       A\n",
       "2402    S_2   C_4      M  street_7     166      82  48.7       B\n",
       "2403    S_2   C_4      F  street_6     158      60  59.7      B+\n",
       "2404    S_2   C_4      F  street_2     160      84  67.7       B\n",
       "2405    S_2   C_4      F  street_6     193      54  47.6       B"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grouped_mul = df.groupby(['School','Class'])\n",
    "grouped_mul.get_group(('S_2','C_4'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### （c）组容量与组数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "School\n",
       "S_1    15\n",
       "S_2    20\n",
       "dtype: int64"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grouped_single.size()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "School  Class\n",
       "S_1     C_1      5\n",
       "        C_2      5\n",
       "        C_3      5\n",
       "S_2     C_1      5\n",
       "        C_2      5\n",
       "        C_3      5\n",
       "        C_4      5\n",
       "dtype: int64"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grouped_mul.size()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grouped_single.ngroups"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "7"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grouped_mul.ngroups"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### （d）组的遍历"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "S_1\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>School</th>\n",
       "      <th>Class</th>\n",
       "      <th>Gender</th>\n",
       "      <th>Address</th>\n",
       "      <th>Height</th>\n",
       "      <th>Weight</th>\n",
       "      <th>Math</th>\n",
       "      <th>Physics</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1101</th>\n",
       "      <td>S_1</td>\n",
       "      <td>C_1</td>\n",
       "      <td>M</td>\n",
       "      <td>street_1</td>\n",
       "      <td>173</td>\n",
       "      <td>63</td>\n",
       "      <td>34.0</td>\n",
       "      <td>A+</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1102</th>\n",
       "      <td>S_1</td>\n",
       "      <td>C_1</td>\n",
       "      <td>F</td>\n",
       "      <td>street_2</td>\n",
       "      <td>192</td>\n",
       "      <td>73</td>\n",
       "      <td>32.5</td>\n",
       "      <td>B+</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1103</th>\n",
       "      <td>S_1</td>\n",
       "      <td>C_1</td>\n",
       "      <td>M</td>\n",
       "      <td>street_2</td>\n",
       "      <td>186</td>\n",
       "      <td>82</td>\n",
       "      <td>87.2</td>\n",
       "      <td>B+</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1104</th>\n",
       "      <td>S_1</td>\n",
       "      <td>C_1</td>\n",
       "      <td>F</td>\n",
       "      <td>street_2</td>\n",
       "      <td>167</td>\n",
       "      <td>81</td>\n",
       "      <td>80.4</td>\n",
       "      <td>B-</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1105</th>\n",
       "      <td>S_1</td>\n",
       "      <td>C_1</td>\n",
       "      <td>F</td>\n",
       "      <td>street_4</td>\n",
       "      <td>159</td>\n",
       "      <td>64</td>\n",
       "      <td>84.8</td>\n",
       "      <td>B+</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     School Class Gender   Address  Height  Weight  Math Physics\n",
       "ID                                                              \n",
       "1101    S_1   C_1      M  street_1     173      63  34.0      A+\n",
       "1102    S_1   C_1      F  street_2     192      73  32.5      B+\n",
       "1103    S_1   C_1      M  street_2     186      82  87.2      B+\n",
       "1104    S_1   C_1      F  street_2     167      81  80.4      B-\n",
       "1105    S_1   C_1      F  street_4     159      64  84.8      B+"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "S_2\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>School</th>\n",
       "      <th>Class</th>\n",
       "      <th>Gender</th>\n",
       "      <th>Address</th>\n",
       "      <th>Height</th>\n",
       "      <th>Weight</th>\n",
       "      <th>Math</th>\n",
       "      <th>Physics</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2101</th>\n",
       "      <td>S_2</td>\n",
       "      <td>C_1</td>\n",
       "      <td>M</td>\n",
       "      <td>street_7</td>\n",
       "      <td>174</td>\n",
       "      <td>84</td>\n",
       "      <td>83.3</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2102</th>\n",
       "      <td>S_2</td>\n",
       "      <td>C_1</td>\n",
       "      <td>F</td>\n",
       "      <td>street_6</td>\n",
       "      <td>161</td>\n",
       "      <td>61</td>\n",
       "      <td>50.6</td>\n",
       "      <td>B+</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2103</th>\n",
       "      <td>S_2</td>\n",
       "      <td>C_1</td>\n",
       "      <td>M</td>\n",
       "      <td>street_4</td>\n",
       "      <td>157</td>\n",
       "      <td>61</td>\n",
       "      <td>52.5</td>\n",
       "      <td>B-</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2104</th>\n",
       "      <td>S_2</td>\n",
       "      <td>C_1</td>\n",
       "      <td>F</td>\n",
       "      <td>street_5</td>\n",
       "      <td>159</td>\n",
       "      <td>97</td>\n",
       "      <td>72.2</td>\n",
       "      <td>B+</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2105</th>\n",
       "      <td>S_2</td>\n",
       "      <td>C_1</td>\n",
       "      <td>M</td>\n",
       "      <td>street_4</td>\n",
       "      <td>170</td>\n",
       "      <td>81</td>\n",
       "      <td>34.2</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     School Class Gender   Address  Height  Weight  Math Physics\n",
       "ID                                                              \n",
       "2101    S_2   C_1      M  street_7     174      84  83.3       C\n",
       "2102    S_2   C_1      F  street_6     161      61  50.6      B+\n",
       "2103    S_2   C_1      M  street_4     157      61  52.5      B-\n",
       "2104    S_2   C_1      F  street_5     159      97  72.2      B+\n",
       "2105    S_2   C_1      M  street_4     170      81  34.2       A"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "for name,group in grouped_single:\n",
    "    print(name)\n",
    "    display(group.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### （e）level参数（用于多级索引）和axis参数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>Class</th>\n",
       "      <th>Address</th>\n",
       "      <th>Height</th>\n",
       "      <th>Weight</th>\n",
       "      <th>Math</th>\n",
       "      <th>Physics</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Gender</th>\n",
       "      <th>School</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>M</th>\n",
       "      <th>S_1</th>\n",
       "      <td>C_1</td>\n",
       "      <td>street_1</td>\n",
       "      <td>173</td>\n",
       "      <td>63</td>\n",
       "      <td>34.0</td>\n",
       "      <td>A+</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>F</th>\n",
       "      <th>S_1</th>\n",
       "      <td>C_1</td>\n",
       "      <td>street_2</td>\n",
       "      <td>192</td>\n",
       "      <td>73</td>\n",
       "      <td>32.5</td>\n",
       "      <td>B+</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>M</th>\n",
       "      <th>S_1</th>\n",
       "      <td>C_1</td>\n",
       "      <td>street_2</td>\n",
       "      <td>186</td>\n",
       "      <td>82</td>\n",
       "      <td>87.2</td>\n",
       "      <td>B+</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">F</th>\n",
       "      <th>S_1</th>\n",
       "      <td>C_1</td>\n",
       "      <td>street_2</td>\n",
       "      <td>167</td>\n",
       "      <td>81</td>\n",
       "      <td>80.4</td>\n",
       "      <td>B-</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>S_1</th>\n",
       "      <td>C_1</td>\n",
       "      <td>street_4</td>\n",
       "      <td>159</td>\n",
       "      <td>64</td>\n",
       "      <td>84.8</td>\n",
       "      <td>B+</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              Class   Address  Height  Weight  Math Physics\n",
       "Gender School                                              \n",
       "M      S_1      C_1  street_1     173      63  34.0      A+\n",
       "F      S_1      C_1  street_2     192      73  32.5      B+\n",
       "M      S_1      C_1  street_2     186      82  87.2      B+\n",
       "F      S_1      C_1  street_2     167      81  80.4      B-\n",
       "       S_1      C_1  street_4     159      64  84.8      B+"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.set_index(['Gender','School']).groupby(level=1,axis=0).get_group('S_1').head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. groupby对象的特点\n",
    "#### （a）查看所有可调用的方法\n",
    "#### 由此可见，groupby对象可以使用相当多的函数，灵活程度很高"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Address', 'Class', 'Gender', 'Height', 'Math', 'Physics', 'School', 'Weight', 'agg', 'aggregate', 'all', 'any', 'apply', 'backfill', 'bfill', 'boxplot', 'corr', 'corrwith', 'count', 'cov', 'cumcount', 'cummax', 'cummin', 'cumprod', 'cumsum', 'describe', 'diff', 'dtypes', 'expanding', 'ffill', 'fillna', 'filter', 'first', 'get_group', 'groups', 'head', 'hist', 'idxmax', 'idxmin', 'indices', 'last', 'mad', 'max', 'mean', 'median', 'min', 'ndim', 'ngroup', 'ngroups', 'nth', 'nunique', 'ohlc', 'pad', 'pct_change', 'pipe', 'plot', 'prod', 'quantile', 'rank', 'resample', 'rolling', 'sem', 'shift', 'size', 'skew', 'std', 'sum', 'tail', 'take', 'transform', 'tshift', 'var']\n"
     ]
    }
   ],
   "source": [
    "print([attr for attr in dir(grouped_single) if not attr.startswith('_')])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### （b）分组对象的head和first\n",
    "#### 对分组对象使用head函数，返回的是每个组的前几行，而不是数据集前几行"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>School</th>\n",
       "      <th>Class</th>\n",
       "      <th>Gender</th>\n",
       "      <th>Address</th>\n",
       "      <th>Height</th>\n",
       "      <th>Weight</th>\n",
       "      <th>Math</th>\n",
       "      <th>Physics</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1101</th>\n",
       "      <td>S_1</td>\n",
       "      <td>C_1</td>\n",
       "      <td>M</td>\n",
       "      <td>street_1</td>\n",
       "      <td>173</td>\n",
       "      <td>63</td>\n",
       "      <td>34.0</td>\n",
       "      <td>A+</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1102</th>\n",
       "      <td>S_1</td>\n",
       "      <td>C_1</td>\n",
       "      <td>F</td>\n",
       "      <td>street_2</td>\n",
       "      <td>192</td>\n",
       "      <td>73</td>\n",
       "      <td>32.5</td>\n",
       "      <td>B+</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2101</th>\n",
       "      <td>S_2</td>\n",
       "      <td>C_1</td>\n",
       "      <td>M</td>\n",
       "      <td>street_7</td>\n",
       "      <td>174</td>\n",
       "      <td>84</td>\n",
       "      <td>83.3</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2102</th>\n",
       "      <td>S_2</td>\n",
       "      <td>C_1</td>\n",
       "      <td>F</td>\n",
       "      <td>street_6</td>\n",
       "      <td>161</td>\n",
       "      <td>61</td>\n",
       "      <td>50.6</td>\n",
       "      <td>B+</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     School Class Gender   Address  Height  Weight  Math Physics\n",
       "ID                                                              \n",
       "1101    S_1   C_1      M  street_1     173      63  34.0      A+\n",
       "1102    S_1   C_1      F  street_2     192      73  32.5      B+\n",
       "2101    S_2   C_1      M  street_7     174      84  83.3       C\n",
       "2102    S_2   C_1      F  street_6     161      61  50.6      B+"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grouped_single.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### first显示的是以分组为索引的每组的第一个分组信息"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Class</th>\n",
       "      <th>Gender</th>\n",
       "      <th>Address</th>\n",
       "      <th>Height</th>\n",
       "      <th>Weight</th>\n",
       "      <th>Math</th>\n",
       "      <th>Physics</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>School</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>S_1</th>\n",
       "      <td>C_1</td>\n",
       "      <td>M</td>\n",
       "      <td>street_1</td>\n",
       "      <td>173</td>\n",
       "      <td>63</td>\n",
       "      <td>34.0</td>\n",
       "      <td>A+</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>S_2</th>\n",
       "      <td>C_1</td>\n",
       "      <td>M</td>\n",
       "      <td>street_7</td>\n",
       "      <td>174</td>\n",
       "      <td>84</td>\n",
       "      <td>83.3</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       Class Gender   Address  Height  Weight  Math Physics\n",
       "School                                                     \n",
       "S_1      C_1      M  street_1     173      63  34.0      A+\n",
       "S_2      C_1      M  street_7     174      84  83.3       C"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grouped_single.first()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### （c）分组依据\n",
    "#### 对于groupby函数而言，分组的依据是非常自由的，只要是与数据框长度相同的列表即可，同时支持函数型分组"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>School</th>\n",
       "      <th>Class</th>\n",
       "      <th>Gender</th>\n",
       "      <th>Address</th>\n",
       "      <th>Height</th>\n",
       "      <th>Weight</th>\n",
       "      <th>Math</th>\n",
       "      <th>Physics</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1103</th>\n",
       "      <td>S_1</td>\n",
       "      <td>C_1</td>\n",
       "      <td>M</td>\n",
       "      <td>street_2</td>\n",
       "      <td>186</td>\n",
       "      <td>82</td>\n",
       "      <td>87.2</td>\n",
       "      <td>B+</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1201</th>\n",
       "      <td>S_1</td>\n",
       "      <td>C_2</td>\n",
       "      <td>M</td>\n",
       "      <td>street_5</td>\n",
       "      <td>188</td>\n",
       "      <td>68</td>\n",
       "      <td>97.0</td>\n",
       "      <td>A-</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1203</th>\n",
       "      <td>S_1</td>\n",
       "      <td>C_2</td>\n",
       "      <td>M</td>\n",
       "      <td>street_6</td>\n",
       "      <td>160</td>\n",
       "      <td>53</td>\n",
       "      <td>58.8</td>\n",
       "      <td>A+</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1204</th>\n",
       "      <td>S_1</td>\n",
       "      <td>C_2</td>\n",
       "      <td>F</td>\n",
       "      <td>street_5</td>\n",
       "      <td>162</td>\n",
       "      <td>63</td>\n",
       "      <td>33.8</td>\n",
       "      <td>B</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2101</th>\n",
       "      <td>S_2</td>\n",
       "      <td>C_1</td>\n",
       "      <td>M</td>\n",
       "      <td>street_7</td>\n",
       "      <td>174</td>\n",
       "      <td>84</td>\n",
       "      <td>83.3</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     School Class Gender   Address  Height  Weight  Math Physics\n",
       "ID                                                              \n",
       "1103    S_1   C_1      M  street_2     186      82  87.2      B+\n",
       "1201    S_1   C_2      M  street_5     188      68  97.0      A-\n",
       "1203    S_1   C_2      M  street_6     160      53  58.8      A+\n",
       "1204    S_1   C_2      F  street_5     162      63  33.8       B\n",
       "2101    S_2   C_1      M  street_7     174      84  83.3       C"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.groupby(np.random.choice(['a','b','c'],df.shape[0])).get_group('a').head()\n",
    "#相当于将np.random.choice(['a','b','c'],df.shape[0])当做新的一列进行分组"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 从原理上说，我们可以看到利用函数时，传入的对象就是索引，因此根据这一特性可以做一些复杂的操作"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1101\n",
      "1102\n",
      "1103\n",
      "1104\n",
      "1105\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>School</th>\n",
       "      <th>Class</th>\n",
       "      <th>Gender</th>\n",
       "      <th>Address</th>\n",
       "      <th>Height</th>\n",
       "      <th>Weight</th>\n",
       "      <th>Math</th>\n",
       "      <th>Physics</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: [School, Class, Gender, Address, Height, Weight, Math, Physics]\n",
       "Index: []"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[:5].groupby(lambda x:print(x)).head(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 根据奇偶行分组"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'偶数行': Int64Index([1102, 1104, 1201, 1203, 1205, 1302, 1304, 2101, 2103, 2105, 2202,\n",
       "             2204, 2301, 2303, 2305, 2402, 2404],\n",
       "            dtype='int64', name='ID'),\n",
       " '奇数行': Int64Index([1101, 1103, 1105, 1202, 1204, 1301, 1303, 1305, 2102, 2104, 2201,\n",
       "             2203, 2205, 2302, 2304, 2401, 2403, 2405],\n",
       "            dtype='int64', name='ID')}"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.groupby(lambda x:'奇数行' if not df.index.get_loc(x)%2==1 else '偶数行').groups"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 如果是多层索引，那么lambda表达式中的输入就是元组，下面实现的功能为查看两所学校中男女生分别均分是否及格\n",
    "#### 注意：此处只是演示groupby的用法，实际操作不会这样写"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(('F', 'S_1'), '均分及格')\n",
      "(('F', 'S_2'), '均分及格')\n",
      "(('M', 'S_1'), '均分及格')\n",
      "(('M', 'S_2'), '均分不及格')\n"
     ]
    }
   ],
   "source": [
    "math_score = df.set_index(['Gender','School'])['Math'].sort_index()\n",
    "grouped_score = df.set_index(['Gender','School']).sort_index().\\\n",
    "            groupby(lambda x:(x,'均分及格' if math_score[x].mean()>=60 else '均分不及格'))\n",
    "for name,_ in grouped_score:print(name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### （d）groupby的[]操作\n",
    "#### 可以用[]选出groupby对象的某个或者某几个列，上面的均分比较可以如下简洁地写出："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Gender  School\n",
       "F       S_1        True\n",
       "        S_2        True\n",
       "M       S_1        True\n",
       "        S_2       False\n",
       "Name: Math, dtype: bool"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.groupby(['Gender','School'])['Math'].mean()>=60"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 用列表可选出多个属性列："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>Math</th>\n",
       "      <th>Height</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Gender</th>\n",
       "      <th>School</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">F</th>\n",
       "      <th>S_1</th>\n",
       "      <td>64.100000</td>\n",
       "      <td>173.125000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>S_2</th>\n",
       "      <td>66.427273</td>\n",
       "      <td>173.727273</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">M</th>\n",
       "      <th>S_1</th>\n",
       "      <td>63.342857</td>\n",
       "      <td>178.714286</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>S_2</th>\n",
       "      <td>51.155556</td>\n",
       "      <td>172.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                    Math      Height\n",
       "Gender School                       \n",
       "F      S_1     64.100000  173.125000\n",
       "       S_2     66.427273  173.727273\n",
       "M      S_1     63.342857  178.714286\n",
       "       S_2     51.155556  172.000000"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.groupby(['Gender','School'])[['Math','Height']].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### （e）连续型变量分组\n",
    "#### 例如利用cut函数对数学成绩分组："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Math\n",
       "(0, 40]       7\n",
       "(40, 60]     10\n",
       "(60, 80]      9\n",
       "(80, 90]      7\n",
       "(90, 100]     2\n",
       "Name: Math, dtype: int64"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bins = [0,40,60,80,90,100]\n",
    "cuts = pd.cut(df['Math'],bins=bins) #可选label添加自定义标签\n",
    "df.groupby(cuts)['Math'].count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 三、聚合、过滤和变换\n",
    "### 1. 聚合（Aggregation）\n",
    "#### （a）常用聚合函数\n",
    "#### 所谓聚合就是把一堆数，变成一个标量，因此mean/sum/size/count/std/var/sem/describe/first/last/nth/min/max都是聚合函数\n",
    "#### 为了熟悉操作，不妨验证标准误sem函数，它的计算公式是：$\\frac{组内标准差}{\\sqrt{组容量}}$，下面进行验证："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ True,  True])"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "group_m = grouped_single['Math']\n",
    "group_m.std().values/np.sqrt(group_m.count().values)== group_m.sem().values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### （b）同时使用多个聚合函数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sum</th>\n",
       "      <th>mean</th>\n",
       "      <th>std</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>School</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>S_1</th>\n",
       "      <td>956.2</td>\n",
       "      <td>63.746667</td>\n",
       "      <td>23.077474</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>S_2</th>\n",
       "      <td>1191.1</td>\n",
       "      <td>59.555000</td>\n",
       "      <td>17.589305</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           sum       mean        std\n",
       "School                              \n",
       "S_1      956.2  63.746667  23.077474\n",
       "S_2     1191.1  59.555000  17.589305"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "group_m.agg(['sum','mean','std'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 利用元组进行重命名"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>rename_sum</th>\n",
       "      <th>rename_mean</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>School</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>S_1</th>\n",
       "      <td>956.2</td>\n",
       "      <td>63.746667</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>S_2</th>\n",
       "      <td>1191.1</td>\n",
       "      <td>59.555000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        rename_sum  rename_mean\n",
       "School                         \n",
       "S_1          956.2    63.746667\n",
       "S_2         1191.1    59.555000"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "group_m.agg([('rename_sum','sum'),('rename_mean','mean')])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 指定哪些函数作用哪些列"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr:last-of-type th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th colspan=\"2\" halign=\"left\">Math</th>\n",
       "      <th>Height</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>mean</th>\n",
       "      <th>max</th>\n",
       "      <th>var</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>School</th>\n",
       "      <th>Class</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"3\" valign=\"top\">S_1</th>\n",
       "      <th>C_1</th>\n",
       "      <td>63.78</td>\n",
       "      <td>87.2</td>\n",
       "      <td>183.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C_2</th>\n",
       "      <td>64.30</td>\n",
       "      <td>97.0</td>\n",
       "      <td>132.8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C_3</th>\n",
       "      <td>63.16</td>\n",
       "      <td>87.7</td>\n",
       "      <td>179.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">S_2</th>\n",
       "      <th>C_1</th>\n",
       "      <td>58.56</td>\n",
       "      <td>83.3</td>\n",
       "      <td>54.7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C_2</th>\n",
       "      <td>62.80</td>\n",
       "      <td>85.4</td>\n",
       "      <td>256.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C_3</th>\n",
       "      <td>63.06</td>\n",
       "      <td>95.5</td>\n",
       "      <td>205.7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C_4</th>\n",
       "      <td>53.80</td>\n",
       "      <td>67.7</td>\n",
       "      <td>300.2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               Math       Height\n",
       "               mean   max    var\n",
       "School Class                    \n",
       "S_1    C_1    63.78  87.2  183.3\n",
       "       C_2    64.30  97.0  132.8\n",
       "       C_3    63.16  87.7  179.2\n",
       "S_2    C_1    58.56  83.3   54.7\n",
       "       C_2    62.80  85.4  256.0\n",
       "       C_3    63.06  95.5  205.7\n",
       "       C_4    53.80  67.7  300.2"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grouped_mul.agg({'Math':['mean','max'],'Height':'var'})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### （c）使用自定义函数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1101    34.0\n",
      "1102    32.5\n",
      "1103    87.2\n",
      "1104    80.4\n",
      "1105    84.8\n",
      "Name: Math, dtype: float64 间隔\n",
      "2101    83.3\n",
      "2102    50.6\n",
      "2103    52.5\n",
      "2104    72.2\n",
      "2105    34.2\n",
      "Name: Math, dtype: float64 间隔\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "School\n",
       "S_1    None\n",
       "S_2    None\n",
       "Name: Math, dtype: object"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grouped_single['Math'].agg(lambda x:print(x.head(),'间隔'))\n",
    "#可以发现，agg函数的传入是分组逐列进行的，有了这个特性就可以做许多事情"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 官方没有提供极差计算的函数，但通过agg可以容易地实现组内极差计算"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "School\n",
       "S_1    65.5\n",
       "S_2    62.8\n",
       "Name: Math, dtype: float64"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grouped_single['Math'].agg(lambda x:x.max()-x.min())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### （d）利用NamedAgg函数进行多个聚合\n",
    "#### 注意：不支持lambda函数，但是可以使用外置的def函数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>min_score1</th>\n",
       "      <th>max_score1</th>\n",
       "      <th>range_score2</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>School</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>S_1</th>\n",
       "      <td>65.5</td>\n",
       "      <td>97.0</td>\n",
       "      <td>33.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>S_2</th>\n",
       "      <td>62.8</td>\n",
       "      <td>95.5</td>\n",
       "      <td>39.4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        min_score1  max_score1  range_score2\n",
       "School                                      \n",
       "S_1           65.5        97.0          33.5\n",
       "S_2           62.8        95.5          39.4"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def R1(x):\n",
    "    return x.max()-x.min()\n",
    "def R2(x):\n",
    "    return x.max()-x.median()\n",
    "grouped_single['Math'].agg(min_score1=pd.NamedAgg(column='col1', aggfunc=R1),\n",
    "                           max_score1=pd.NamedAgg(column='col2', aggfunc='max'),\n",
    "                           range_score2=pd.NamedAgg(column='col3', aggfunc=R2)).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### （e）带参数的聚合函数\n",
    "#### 判断是否组内数学分数至少有一个值在50-52之间："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "School\n",
       "S_1    False\n",
       "S_2     True\n",
       "Name: Math, dtype: bool"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def f(s,low,high):\n",
    "    return s.between(low,high).max()\n",
    "grouped_single['Math'].agg(f,50,52)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 如果需要使用多个函数，并且其中至少有一个带参数，则使用wrap技巧："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>func_name</th>\n",
       "      <th>mean</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>School</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>S_1</th>\n",
       "      <td>False</td>\n",
       "      <td>63.746667</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>S_2</th>\n",
       "      <td>True</td>\n",
       "      <td>59.555000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        func_name       mean\n",
       "School                      \n",
       "S_1         False  63.746667\n",
       "S_2          True  59.555000"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def f_test(s,low,high):\n",
    "    return s.between(low,high).max()\n",
    "def agg_f(f_mul,name,*args):\n",
    "    def wrapper(x):\n",
    "        return f_mul(x,*args)\n",
    "    wrapper.__name__ = name\n",
    "    return wrapper\n",
    "grouped_single['Math'].agg([agg_f(f_test,'func_name',50,52),'mean'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. 过滤（Filteration）\n",
    "#### filter函数是用来筛选某些组的（务必记住结果是组的全体），因此传入的值应当是布尔标量"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Math</th>\n",
       "      <th>Physics</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2101</th>\n",
       "      <td>83.3</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2102</th>\n",
       "      <td>50.6</td>\n",
       "      <td>B+</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2103</th>\n",
       "      <td>52.5</td>\n",
       "      <td>B-</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2104</th>\n",
       "      <td>72.2</td>\n",
       "      <td>B+</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2105</th>\n",
       "      <td>34.2</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      Math Physics\n",
       "ID                \n",
       "2101  83.3       C\n",
       "2102  50.6      B+\n",
       "2103  52.5      B-\n",
       "2104  72.2      B+\n",
       "2105  34.2       A"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grouped_single[['Math','Physics']].filter(lambda x:(x['Math']>32).all()).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. 变换（Transformation）\n",
    "#### （a）传入对象\n",
    "#### transform函数中传入的对象是组内的列，并且返回值需要与列长完全一致"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Math</th>\n",
       "      <th>Height</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1101</th>\n",
       "      <td>2.5</td>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1102</th>\n",
       "      <td>1.0</td>\n",
       "      <td>33</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1103</th>\n",
       "      <td>55.7</td>\n",
       "      <td>27</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1104</th>\n",
       "      <td>48.9</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1105</th>\n",
       "      <td>53.3</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      Math  Height\n",
       "ID                \n",
       "1101   2.5      14\n",
       "1102   1.0      33\n",
       "1103  55.7      27\n",
       "1104  48.9       8\n",
       "1105  53.3       0"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grouped_single[['Math','Height']].transform(lambda x:x-x.min()).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 如果返回了标量值，那么组内的所有元素会被广播为这个值"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Math</th>\n",
       "      <th>Height</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1101</th>\n",
       "      <td>63.746667</td>\n",
       "      <td>175.733333</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1102</th>\n",
       "      <td>63.746667</td>\n",
       "      <td>175.733333</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1103</th>\n",
       "      <td>63.746667</td>\n",
       "      <td>175.733333</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1104</th>\n",
       "      <td>63.746667</td>\n",
       "      <td>175.733333</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1105</th>\n",
       "      <td>63.746667</td>\n",
       "      <td>175.733333</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           Math      Height\n",
       "ID                         \n",
       "1101  63.746667  175.733333\n",
       "1102  63.746667  175.733333\n",
       "1103  63.746667  175.733333\n",
       "1104  63.746667  175.733333\n",
       "1105  63.746667  175.733333"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grouped_single[['Math','Height']].transform(lambda x:x.mean()).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### （b）利用变换方法进行组内标准化"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Math</th>\n",
       "      <th>Height</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1101</th>\n",
       "      <td>-1.288991</td>\n",
       "      <td>-0.214991</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1102</th>\n",
       "      <td>-1.353990</td>\n",
       "      <td>1.279460</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1103</th>\n",
       "      <td>1.016287</td>\n",
       "      <td>0.807528</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1104</th>\n",
       "      <td>0.721627</td>\n",
       "      <td>-0.686923</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1105</th>\n",
       "      <td>0.912289</td>\n",
       "      <td>-1.316166</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          Math    Height\n",
       "ID                      \n",
       "1101 -1.288991 -0.214991\n",
       "1102 -1.353990  1.279460\n",
       "1103  1.016287  0.807528\n",
       "1104  0.721627 -0.686923\n",
       "1105  0.912289 -1.316166"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grouped_single[['Math','Height']].transform(lambda x:(x-x.mean())/x.std()).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### （c）利用变换方法进行组内缺失值的均值填充"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ID</th>\n",
       "      <th>Math</th>\n",
       "      <th>School</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1101</td>\n",
       "      <td>34.0</td>\n",
       "      <td>S_1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1102</td>\n",
       "      <td>32.5</td>\n",
       "      <td>S_1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1103</td>\n",
       "      <td>87.2</td>\n",
       "      <td>S_1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1104</td>\n",
       "      <td>80.4</td>\n",
       "      <td>S_1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1105</td>\n",
       "      <td>84.8</td>\n",
       "      <td>S_1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     ID  Math School\n",
       "0  1101  34.0    S_1\n",
       "1  1102  32.5    S_1\n",
       "2  1103  87.2    S_1\n",
       "3  1104  80.4    S_1\n",
       "4  1105  84.8    S_1"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_nan = df[['Math','School']].copy().reset_index()\n",
    "df_nan.loc[np.random.randint(0,df.shape[0],25),['Math']]=np.nan\n",
    "df_nan.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ID</th>\n",
       "      <th>Math</th>\n",
       "      <th>School</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1101</td>\n",
       "      <td>34.0</td>\n",
       "      <td>S_1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1102</td>\n",
       "      <td>32.5</td>\n",
       "      <td>S_1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1103</td>\n",
       "      <td>87.2</td>\n",
       "      <td>S_1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1104</td>\n",
       "      <td>80.4</td>\n",
       "      <td>S_1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1105</td>\n",
       "      <td>84.8</td>\n",
       "      <td>S_1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     ID  Math School\n",
       "0  1101  34.0    S_1\n",
       "1  1102  32.5    S_1\n",
       "2  1103  87.2    S_1\n",
       "3  1104  80.4    S_1\n",
       "4  1105  84.8    S_1"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_nan.groupby('School').transform(lambda x: x.fillna(x.mean())).join(df.reset_index()['School']).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 四、apply函数"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. apply函数的灵活性\n",
    "#### 可能在所有的分组函数中，apply是应用最为广泛的，这得益于它的灵活性：\n",
    "#### 对于传入值而言，从下面的打印内容可以看到是以分组的表传入apply中："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     School Class Gender   Address  Height  Weight  Math Physics\n",
      "ID                                                              \n",
      "1101    S_1   C_1      M  street_1     173      63  34.0      A+\n",
      "     School Class Gender   Address  Height  Weight  Math Physics\n",
      "ID                                                              \n",
      "2101    S_2   C_1      M  street_7     174      84  83.3       C\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: []\n",
       "Index: []"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.groupby('School').apply(lambda x:print(x.head(1)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### apply函数的灵活性很大程度来源于其返回值的多样性：\n",
    "#### ① 标量返回值"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>School</th>\n",
       "      <th>Math</th>\n",
       "      <th>Height</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>School</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>S_1</th>\n",
       "      <td>S_1</td>\n",
       "      <td>97.0</td>\n",
       "      <td>195</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>S_2</th>\n",
       "      <td>S_2</td>\n",
       "      <td>95.5</td>\n",
       "      <td>194</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       School  Math  Height\n",
       "School                     \n",
       "S_1       S_1  97.0     195\n",
       "S_2       S_2  95.5     194"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[['School','Math','Height']].groupby('School').apply(lambda x:x.max())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### ② 列表返回值"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Math</th>\n",
       "      <th>Height</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1101</th>\n",
       "      <td>2.5</td>\n",
       "      <td>14.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1102</th>\n",
       "      <td>1.0</td>\n",
       "      <td>33.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1103</th>\n",
       "      <td>55.7</td>\n",
       "      <td>27.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1104</th>\n",
       "      <td>48.9</td>\n",
       "      <td>8.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1105</th>\n",
       "      <td>53.3</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      Math  Height\n",
       "ID                \n",
       "1101   2.5    14.0\n",
       "1102   1.0    33.0\n",
       "1103  55.7    27.0\n",
       "1104  48.9     8.0\n",
       "1105  53.3     0.0"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[['School','Math','Height']].groupby('School').apply(lambda x:x-x.min()).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### ③ 数据框返回值"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>col1</th>\n",
       "      <th>col2</th>\n",
       "      <th>col3</th>\n",
       "      <th>col4</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1101</th>\n",
       "      <td>-63.0</td>\n",
       "      <td>2.5</td>\n",
       "      <td>-22</td>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1102</th>\n",
       "      <td>-64.5</td>\n",
       "      <td>1.0</td>\n",
       "      <td>-3</td>\n",
       "      <td>33</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1103</th>\n",
       "      <td>-9.8</td>\n",
       "      <td>55.7</td>\n",
       "      <td>-9</td>\n",
       "      <td>27</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1104</th>\n",
       "      <td>-16.6</td>\n",
       "      <td>48.9</td>\n",
       "      <td>-28</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1105</th>\n",
       "      <td>-12.2</td>\n",
       "      <td>53.3</td>\n",
       "      <td>-36</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      col1  col2  col3  col4\n",
       "ID                          \n",
       "1101 -63.0   2.5   -22    14\n",
       "1102 -64.5   1.0    -3    33\n",
       "1103  -9.8  55.7    -9    27\n",
       "1104 -16.6  48.9   -28     8\n",
       "1105 -12.2  53.3   -36     0"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[['School','Math','Height']].groupby('School')\\\n",
    "    .apply(lambda x:pd.DataFrame({'col1':x['Math']-x['Math'].max(),\n",
    "                                  'col2':x['Math']-x['Math'].min(),\n",
    "                                  'col3':x['Height']-x['Height'].max(),\n",
    "                                  'col4':x['Height']-x['Height'].min()})).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. 用apply同时统计多个指标\n",
    "#### 此处可以借助OrderedDict工具进行快捷的统计："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>M_sum</th>\n",
       "      <th>W_var</th>\n",
       "      <th>H_mean</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>School</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>S_1</th>\n",
       "      <td>956.2</td>\n",
       "      <td>117.428571</td>\n",
       "      <td>175.733333</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>S_2</th>\n",
       "      <td>1191.1</td>\n",
       "      <td>181.081579</td>\n",
       "      <td>172.950000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         M_sum       W_var      H_mean\n",
       "School                                \n",
       "S_1      956.2  117.428571  175.733333\n",
       "S_2     1191.1  181.081579  172.950000"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from collections import OrderedDict\n",
    "def f(df):\n",
    "    data = OrderedDict()\n",
    "    data['M_sum'] = df['Math'].sum()\n",
    "    data['W_var'] = df['Weight'].var()\n",
    "    data['H_mean'] = df['Height'].mean()\n",
    "    return pd.Series(data)\n",
    "grouped_single.apply(f)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 五、问题与练习"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. 问题"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 【问题一】 什么是fillna的前向/后向填充，如何实现？"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 【问题二】 下面的代码实现了什么功能？请仿照设计一个它的groupby版本。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [],
   "source": [
    "s = pd.Series ([0, 1, 1, 0, 1, 1, 1, 0])\n",
    "s1 = s.cumsum()\n",
    "result = s.mul(s1).diff().where(lambda x: x < 0).ffill().add(s1,fill_value =0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 【问题三】 如何计算组内0.25分位数与0.75分位数？要求显示在同一张表上。\n",
    "#### 【问题四】 既然索引已经能够选出某些符合条件的子集，那么filter函数的设计有什么意义？\n",
    "#### 【问题五】 整合、变换、过滤三者在输入输出和功能上有何异同？\n",
    "#### 【问题六】 在带参数的多函数聚合时，有办法能够绕过wrap技巧实现同样功能吗？"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. 练习\n",
    "#### 【练习一】： 现有一份关于diamonds的数据集，列分别记录了克拉数、颜色、开采深度、价格，请解决下列问题："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>carat</th>\n",
       "      <th>color</th>\n",
       "      <th>depth</th>\n",
       "      <th>price</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.23</td>\n",
       "      <td>E</td>\n",
       "      <td>61.5</td>\n",
       "      <td>326</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.21</td>\n",
       "      <td>E</td>\n",
       "      <td>59.8</td>\n",
       "      <td>326</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.23</td>\n",
       "      <td>E</td>\n",
       "      <td>56.9</td>\n",
       "      <td>327</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.29</td>\n",
       "      <td>I</td>\n",
       "      <td>62.4</td>\n",
       "      <td>334</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.31</td>\n",
       "      <td>J</td>\n",
       "      <td>63.3</td>\n",
       "      <td>335</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   carat color  depth  price\n",
       "0   0.23     E   61.5    326\n",
       "1   0.21     E   59.8    326\n",
       "2   0.23     E   56.9    327\n",
       "3   0.29     I   62.4    334\n",
       "4   0.31     J   63.3    335"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.read_csv('data/Diamonds.csv').head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### (a) 在所有重量超过1克拉的钻石中，价格的极差是多少？\n",
    "#### (b) 若以开采深度的0.2\\0.4\\0.6\\0.8分位数为分组依据，每一组中钻石颜色最多的是哪一种？该种颜色是组内平均而言单位重量最贵的吗？\n",
    "#### (c) 以重量分组(0-0.5,0.5-1,1-1.5,1.5-2,2+)，按递增的深度为索引排序，求每组中连续的严格递增价格序列长度的最大值。\n",
    "#### (d) 请按颜色分组，分别计算价格关于克拉数的回归系数。（单变量的简单线性回归，并只使用Pandas和Numpy完成）"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 【练习二】：有一份关于美国10年至17年的非法药物数据集，列分别记录了年份、州（5个）、县、药物类型、报告数量，请解决下列问题："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>YYYY</th>\n",
       "      <th>State</th>\n",
       "      <th>COUNTY</th>\n",
       "      <th>SubstanceName</th>\n",
       "      <th>DrugReports</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2010</td>\n",
       "      <td>VA</td>\n",
       "      <td>ACCOMACK</td>\n",
       "      <td>Propoxyphene</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2010</td>\n",
       "      <td>OH</td>\n",
       "      <td>ADAMS</td>\n",
       "      <td>Morphine</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2010</td>\n",
       "      <td>PA</td>\n",
       "      <td>ADAMS</td>\n",
       "      <td>Methadone</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2010</td>\n",
       "      <td>VA</td>\n",
       "      <td>ALEXANDRIA CITY</td>\n",
       "      <td>Heroin</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2010</td>\n",
       "      <td>PA</td>\n",
       "      <td>ALLEGHENY</td>\n",
       "      <td>Hydromorphone</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   YYYY State           COUNTY  SubstanceName  DrugReports\n",
       "0  2010    VA         ACCOMACK   Propoxyphene            1\n",
       "1  2010    OH            ADAMS       Morphine            9\n",
       "2  2010    PA            ADAMS      Methadone            2\n",
       "3  2010    VA  ALEXANDRIA CITY         Heroin            5\n",
       "4  2010    PA        ALLEGHENY  Hydromorphone            5"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.read_csv('data/Drugs.csv').head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### (a) 按照年份统计，哪个县的报告数量最多？这个县所属的州在当年也是报告数最多的吗？\n",
    "#### (b) 从14年到15年，Heroin的数量增加最多的是哪一个州？它在这个州是所有药物中增幅最大的吗？若不是，请找出符合该条件的药物。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
